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ABSTRACT 

Promising results have been achieved in image classification 
problems by exploiting the discriminative power of sparse 
representations for classification (SRC). Recently, it has been 
shown that the use of class-specific spike-and-slab priors in 
conjunction with the class-specific dictionaries from SRC is 
particularly effective in low training scenarios. As a logi¬ 
cal extension, we build on this framework for multitask sce¬ 
narios, wherein multiple representations of the same physi¬ 
cal phenomena are available. We experimentally demonstrate 
the benefits of mining joint information from different camera 
views for multi-view face recognition. 

Index Terms — Image analysis, sparse representation, 
structured priors, spike-and-slab, face recognition. 

1. INTRODUCTION 

Image classification is an important problem that has been 
studied for over three decades. Practical applications span a 
wide variety of areas such as general texture or object cate¬ 
gorization 00 , face recognition (3]|4) and automatic target 
recognition in hyperspectral or radar imagery @ 0 . Various 
methods of feature extraction ( |7][8) for example) as well as 
classifiers 00 have been investigated. 

The advent of of compressive sensing (CS) (TO) has in¬ 
spired research in the direction of applying the central ana¬ 
lytical formulation of CS to classification problems. Sparse 
representation-based classification (SRC) CD is arguably the 
most well-known such tool that has demonstrated robust per¬ 
formance even in the presence of high pixel distortion, oc¬ 
clusion or noise. Extensions of SRC have been proposed 
along two lines of thought: (i) by adding regularizer and pri¬ 
ors which prevent overfitting issue by introducing additional 
information to the problem and (ii) by exploiting joint infor¬ 
mation and complementary data in multitask cases. We are 
also moving along this direction to use the advantages of us¬ 
ing priors as well as joint information. 

Copyright ©2010 IEEE. Personal use of this material is permitted. 
However, permission to use this material for any other purposes must be ob¬ 
tained from the IEEE by sending a request to pubs-permissions@ieee.org. 


Yuanming Suo, M. Dao, T. D. Tran 

Dept, of Electrical and Computer Engineering 
The Johns Hopkins University 


Motivation and Contribution: Advances in sensing tech¬ 
nology have facilitated the easy acquisition of multiple dif¬ 
ferent measurements of the same underlying physical phe¬ 
nomena. Often there is complimentary information embed¬ 
ded in these different measurements which can be exploited 
for improved performance. For example, in face recognition 
or action recognition we could have different views of a per¬ 
son’s face captured under different illumination conditions or 
with different facial postures |2} |12}|T5| . In automatic target 
recognition, multiple SAR (synthetic aperture radar) views 
are acquired fl6| . The use of complimentary information 
from different color image channels in medical imaging has 
been demonstrated in 0D- In border security applications, 
multi-modal sensor data such as voice sensor measurements, 
infrared images and seismic measurements are fused JT8| for 
activity classification tasks. The prevalence of such a rich 
variety of applications where multi-sensor information man¬ 
ifests in different ways is a key motivation for our contri¬ 
bution in this paper. Specifically, we extend recent work in 
class-specific sparse prior-based classification by Srinivas et 
al. GD to a multitask framework. 

We extend the Bayesian framework in fT9| in a hierarchi¬ 
cal manner in order to capture joint information across multi¬ 
ple tasks (or measurements). As observed in fl9| , an impor¬ 
tant challenge is to develop a framework based on spike-and- 
slab priors that can capture a general notion of signal sparsity 
while also leading to tractable optimization problems. Our 
contribution successfully addresses both these issues via a 
generalized collaborative Bayesian hierarchical method. Ex¬ 
pectedly it results in a hard non-convex optimization problem. 
We propose an efficient solution using Monte Carlo Markov 
Chain (MCMC) method which results in practical benefits as 
demonstrated in Section |4| 

2. SPARSE REPRESENTATION-BASED 
CLASSIFICATION (SRC) 

The aim of CS is essentially to recover a higher dimensional 
compressible (sparse) signal x G W 1 from a set of lower di¬ 
mensional linear measurements y G M m of the form y = Ax. It 
is shown that x is recoverable by solving the following under- 



( 1 ) 


determined (m <C n) optimization problem ( 20 j : 

min ||jic||o subject to y =Ax. 

where | |x| |o is basically the number of non-zero elements of x. 
It is known that 0 is an NP-hard problem but can be solved 
by relaxing the non-convex term i 2 D and solve the following 
optimization problem in presence of noise. 

min | |x| 11 subject to \ \y —Ax | |2 < £. ( 2 ) 

Recently, Wright et al. proposed SRC 0 for face recogni¬ 
tion by designing an over-complete class-specific dictionary 
A and employing CS framework. For a multi-class problem, 
the dictionary matrix A is built using training images from 
all classes as A = [A\ ... Ac] where each column of Ac r is a 
vectorized training image (dictionary atom) from class C r . 

The minimization problem in ^ is equivalent to maxi¬ 
mizing probability of observing the sparse vector x given y 
assuming x has an i.i.d Laplacian distribution in a Bayesian 
framework ['22]. Therefore, sparsity can be interpreted as a 
prior on the coefficient vector which enhances signal compre¬ 
hension by incorporating contextual information and signal 
structure. Structure and sparsity both can be enforced by in¬ 
troducing probabilistic priors, /(x), on the coefficient vector 
and solving the following optimization problem |23| : 

max/(x) subject to \\y— Ax|| 2 <£. (3) 

Recently, Srinivas et al. proposed the use of spike-and-slab 
priors under a Bayesian framework for image classification 
purposes 0 - They proposed one class-conditional proba¬ 
bility distribution function (pdf) per class represented by /q 
and fc 2 with the same class-specific dictionary A as before. 
Given a test vector y class assignment is done by solving the 
following constrained likelihood maximization problem per 
class where fc/s are learned separately using spike-and-slab 
priors: 

Xc r = argmax/c r (x) subject to ||y — Ax|| 2 <£. (4) 

Class(y) = arg max fc r (x Cr ) (5) 

re{ 1 , 2 } 

Inspired by |24| they developed a Bayesian framework for 
classification and considered a linear model y =Ax+w in each 
class where y G M m , x G M", A G R mxn and n is the inherent 
Gaussian noise (we drop the class indices for notational sim¬ 
plicity). Using this set-up the underlying Bayesian framework 
is as follows: 

y|A,x,y,c> 2 ~ tJ\i(Ax,o 2 I) (6) 

xi\a 2 ,yi,X ~ Y;V(0,c> 2 ^ -1 ) + (1-y,-)5 0 , i=l,...,n (7) 
Ji |k ~ Bernoulli(K), i = 1,..., n. (8) 

where A£(.) represents the normal distribution and 0 is 
modeling each x* with a spike-and-slab prior which is a very 



Fig. 1. Row, block and dynamic sparsity from left to right. 

well-suited structured prior for capturing sparsity [ 25 -27 ]. 8 o 
is a point mass concentrated at zero (known as ’’spike”) and 
the other term is the distribution of nonzero coefficients of 
sparse vector also known as ’’slab”. In this framework y,-’s are 
assumed to have binary values, either 1 (if x; = 1 ) or zero (if 
Xi 7 ^ 0 ) in which case they become indicator variables for each 
element of x. It is clear that y can control the sparsity level of 
the signal and at the same time enforce a specific structure. 
As a result, k, which is the probability of each coefficient to 
be nonzero, plays a key role in identifying the structure. In 
the next section, we will show how a smart choice of K can 
lead to a framework that is more general and able to capture 
different sparse structures in the coefficient vector. 


3. MULTI-TASK IMAGE CLASSIFICATION VIA 
COLLABORATIVE, HIERARCHICAL 
SPIKE-AND-SLAB PRIORS 

3.1. Bayesian Framework 

If there are multiple measurements y \,... ,yr of the same sig¬ 
nal, we now have a sparse coefficient matrix X as follows: 

Y:=\yi ...y r ]=A[xi ...x r ]=AX. (9) 

Depending on the application and type of measurement, dif¬ 
ferent notions of matrix sparsity can occur with respect to the 
coefficient matrix X. As illustrated in Fig. [T] in some appli¬ 
cations such as hyperspectral classification, row sparsity - en¬ 
tire rows with all zero or all non-zero coefficients - emerges 
naturally in matrix X, whereas in other applications, block 
sparsity or joint dynamic sparsity are more appropriate. This 
sparsity pattern is an inherent feature of the application and 
the specific way in which the multi-task dictionary has been 
designed. Our contribution is a framework that is applicable 
in a wide variety of applications by capturing a general notion 
of the sparse structure in the coefficient matrix. 

Assume we have T observations (tasks) from the same 
class and put them together to form a matrix Y . We desire 
to find a sparse matrix X such that we can guarantee that 
the error between test matrix and its reconstructed version is 
small and at the same time, matrix X has a sparse structure. 
We modified 0-0 in order to generalize the Bayesian 
framework to multitask case. This modified version should 
be able to model the behaviour of X and induce the desired 


















































structure and sparsity inX. Generalized Bayesian framework 
for multitask case is as follows: 

T 

Y\A,X,r,a 2 n ~ Y[^i(Ax t ,c 2 n l) ( 10 ) 

t= 1 V 7 

X|0 2 ,r,X - pf[ y (i ?v;(O,0 2 ^“ 1 ) + (l-y, l )8o (11) 

f=ll=l 
T n 

r\K ~ nn Bernoulli^.). (12) 

t=li=l 

Where x t is the t th column of the matrix X, K = {l c ti } t j and 
r = {Jii}t,i for t = 1...T and i = 1 ...n 

Remark: Note that in contrast to ([8]) as was proposed in fl9) , 
k is assumed to take different values for each task and each 
coefficient. With this assumption our framework can capture 
more general notions of structure and sparsity in the matrix 
X. This is one of our central analytical contributions in this 
paper, in contrast with methods with relaxed and simplified 
assumptions p9]|28) . 

The benefit of using the Bayesian approach is that it can 
alleviate the burden on requirement of abundant training. 
One of the central assumptions in SRC is existence of suffi¬ 
cient training (overcomplete dictionary A) whereas proposed 
Bayesian approach can handle scenarios that lack the number 
of training. This is enabled by use of class-specific priors that 
can offer more discriminability in the dictionary. 

To perform the Bayesian inference we followed a simi¬ 
lar procedure as in [[19) to obtain the joint posterior density 
function for MAP estimation. Here we only present the re¬ 
sulting optimization problem; the detailed discussion of the 
optimization problem is available online in a technical report 
at [29 ]. It must be noted that we have a different framework 
for each class and as a result, we get C different optimization 
problems to findX^ and 

argmin ^\\Y-AX\\ 2 F +X\\X\\ 2 F + £ £y,p, (13) 

XL °n t=li=l 

where p ti = o 2 log ( 2%C ^ 2 ^ ) • First t erm i n {H} I s basi- 

H 

cally trying to minimize the reconstruction error, whereas the 
second and third terms are jointly trying to keep the coeffi¬ 
cient matrix smooth and sparse. 

Remark: Solution to this optimization problem has not been 
addressed yet in the literature but its relaxations reduce to 
well-known problems in compressive sensing and statistics 
such as LASSO, Elastic Net, etc @|30|^|. 

Finally, after solving each optimization problem per class for 
a given test matrix Y, the class assignment will be as follows. 

Class(y) = arg ^m\n^L r {Xc r \Tc r ) (14) 


3.2. Solution to Optimization Problem 

In this section we provide an efficient solution to the resulting 
optimization problem in ( [13] ). First, we rewrite the problem 
as follows: 

argmin L ( ^2 11 ?* “Mill +^\\ x t\\2 + ( 15 ) 

As it can be seen, it consists of T different independent op¬ 
timization problems which can be solved individually rather 
than solving a complex matrix optimization problem. Hence¬ 
forth, for each class we should follow the following proce¬ 
dure: 

• minimize 4 | \y t ~ Ax t \ \j + X\ \x t \ \\ + £" = x % Pr,, Vr 

Xt ,y t n 

• Put x t ’s and y t ’s together to form X and T, respectively. 

From now onward we only talk about the optimization prob¬ 
lem in class C r and task t. However, it should be solved for 
each C r , r = 1,...,C and f, t = separately. For the 

sake of simplicity, we drop class and task indices and use y, x 
and y instead of y t , x t and y t , respectively. 

Note that the above optimization problem is a hard non- 
convex problem. We propose an efficient solution using 
MCMC method. One of the advantages of introducing a 
hierarchical model in <|To]i-((T 2 ji is that we can obtain the 
marginal distribution f(y\y) /(y|y)/(y). According to [261, 
f(y\y) provides a posterior probability that can be used to se¬ 
lect promising atoms that contribute more in reconstructing y. 
Moreover, finding the optimum y* is also equivalent to finding 
the most prominent atoms in the dictionary A. Henceforth, 
we propose the following scheme to solve the optimization 
problem: first, we find dictionary atoms that contribute in 
reconstructing the test vector y. Then we find the value of 
contribution for each atom we found previously. 

We use MCMC method to extract the promising atoms (see 
Algorithm [l]). According to j26] a Markov Chain of y^’s as 
in ( p~8] ) obtained by Gibbs sampling converges in distribution 
to f(y\y). This sequence can be obtained by Algorithm [I] and 
those with highest appearance frequency correspond to the 
prominent atoms in the dictionary A that can be used for a 
better reconstruction. Details of sampling from distributions 
in Algorithm [T] can be found in J29[ . 

After performing Gibbs sampling, prominent dictionary 
atoms for reconstruction are revealed, then we find the exact 
contribution of each dictionary atom by solving the follow¬ 
ing convex optimization problem on the promising subset of 
dictatory atoms. 

a 2 

^ = argmin -2 | ly-AyXyll^ + JiHxylll (16) 

x v G n 

where Ay is the new dictionary consisting of only contribut¬ 
ing atoms and Xy is the corresponding coefficient vector for 
the reduced problem. Afterward, we place the exact contri¬ 
bution value of each dictionary element into x based on the 



Algorithm 1 MICHS (per class, per task) 

Input: A,K t ,y 

(1) Find contributing atoms: Find sequence of y and x 

jc ( 1) , y (1) , jc ( 2) , y (2) , , y (3) , ••• (1 

Initialization: Iteration counter j = 1, y o) = (1,..., 1) 
and x to be the least square estimate. 

while j < Maxlter do 

(1) Samplefrom f(x^\y^~^) 

(2) Sample i th element of Y^ denoted by from 

f(y { i j) \y,x U) ,V$) for i=l...n 

where 

end while 

Extract sequence of Y from ( [17] ): 

Y (1) , Y (2) , Y (3) , - (1 

Take the most frequent y s in ( [18] ) and form Y* 

(2) Find values of contributionby ( [16] ) 

(3) Insert Values in x* based on y* 

Output: Y*,** 
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Fig. 2. Image acquisition configuration and sample images 


indicator variable y. 

By doing the same procedure for each task and putting the re¬ 
sulting x and Y vectors together, we formX^ andTj^ matrices 
for class C r . Finally, we should do the whole process again in 
each class and obtain the corresponding (X^ ,1^) and do the 
classification based on the value of cost function or residual 
as was presented in ( [14] ). 


Table 1. Recognition rate for C = 129 classes 


View(T) 

MSM 

Graph 

SRC 

JSRC 

JDSRC 

MICHS 

1 

36.5 

44.5 

45.0 

45.0 

45.0 

51.3 

3 

48.9 

63.4 

59.5 

53.6 

72.0 

73.0 


applying MICHS to this problem and compare the results 
with state-of-the-art algorithms. 

CMU Multi-PIE database: We conducted the experiments 
on the CMU Multi-PIE face database CD which contains 
a large number of facial images under different illumi¬ 
nations, view points and expressions up to four sessions. 
There is a collection of cameras at different view angles 
(0 = {0°, ±15°, ±30°, ±45°, ±60°, ±75°, ±90°}) that cap¬ 
tures the same scene. Among all the available individuals, we 
picked C = 129 subjects that are present in all four sessions 
of acquisition. Illustration of multiple camera configuration 
as well as sample images can be found in Fig [2] 

We follow the procedure described in (12) for conducting the 
experiments and build our training dictionary A by picking 
training images from session 1 using a subset of available 
views, i.e. Strain = {0°,±30°, ±60°,±90°}. Test images are 
obtained from all available view angles and from session 
2. This is a more realistic scenario that not all the testing 
poses are available in the dictionary. To generate a test ma¬ 
trix with T views we randomly pick one the C subjects and 
again randomly pick T different views of that person from 
0. We generated two thousand such test samples and com¬ 
pared MICHS [^performance with classical Mutual Subspace 
Method (MSM) (4), Graph-based method in (32), SRC [11] 
combined with majority voting, JSRC method in (33) and 
finally with JDSRC (12). 

To show the effectiveness of our algorithm, first we com¬ 
pared the results for T = 1 (single task scenario) and then 
we showed the results where we have T — 3 different views. 
These results are shown in Table [I] and our method is giving 
the best performance among all algorithms in both cases. Fig 
[3] illustrates the results for a scenario with reduced number 
of classes where we only consider 30 classes out 129 sub¬ 
jects. We compared our result with JDSRC which was shown 
to be second best after us. We argued in section [3] that by 
using class specific priors we can expect to have less sensi¬ 
tivity (slower decay) to insufficiency of number of training 
samples. This is verified in Fig. [3] with reduced number of 
Training Per Class (TPC = 3,5,7). 


4. EXPERIMENTAL RESULTS 

Multiview face recognition is a multi task image classifica¬ 
tion problem of significant importance and Zhang et al. have 
done a thorough investigation on this problem in (12) . In 
order to validate the performance of our proposed Multitask 
Image Classification via collaborative Hierarchical Spike and 
slab priors (MICHS), we present the experimental results of 
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